ggplot2 package
The package ggplot2 is extensive, rich in features, and quite formidable to master.
It implements what Hadley Wickham calls A Layered Grammar of Graphics building on Wilkinson’s Grammar of Graphics.
It is widely used.
The interpretation of a graphic is
To put it in succinctly:
plot ::= coord scale+ facet? layer+
layer ::= data mapping stat geom position?
That is, a plot is defined by a coordinate system (coord), one or more scales (scale), an optional faceting specification (facet), and one or more layers (layer).
A layer is defined as an R data frame (data), a specification mapping columns of that frame into aesthetic properties (mapping), a statistical approach to summarize the rows of that frame (stat), a geometric object to visually represent that summary (geom), and an optional position adjustment to move overlapping geometric objects out of their way (position).
A good way to see some examples is via example(qplot).
Let us look at the elements of a plot in a scatterplot context.
geom determines the type of plot, the point geometry for scatterplot, for exampleCombining all these, the data, geom, scales and coordinate system, and plot annotations generates the scatterplot graphic.
geom = "point" draws pointsgeom ="smooth" fits a smoother to data and plots the smooth plus standard errorgeom = "quantiles" displays conditional density estimates (extension of boxplots)geom = "density2d" adds contours of a 2d density estimate (useful when overplotting is a concern)geom = "path" andgeom =“line”` do the obviousgeom = "boxplot" produces a box and whisker plotgeom = "histogram" and geom = "density" produce histograms or density plotsAvailable statistics and aesthetics are all described in Hadley’s book and also seen in Deducer’s plot builder.
Deducer is a package that uses Java and runs inside a graphical interface such as JGR (pronounced Jaguar). It allows one to build plots interactively and writes out the resulting ggplot2 code. I have found Deducer useful for learning ggplot2 because one can examine how to build up plots.
However, the reliance on Java means that it needs to run in multi-threaded mode for GUI interaction and so you have to work harder to make it go on your platform. As mentioned in the earlier lectures, you need to ensure configure your R for Java (only once) using
Finally, install the packages Deducer and JGR (Jaguar a Java GUI for R) as usual. Then the following will launch JGR as a new application.
In that application, you can now type library(Deducer) to go through a video and examples.
NOTE Deducer may not work on Windows.
As noted above, a layer is a combination of data, mappings, stat, geom, and a position adjustment. You can add as many layers as you want. A layer has
The power of ggplot is that you can build complex plots layer by layer.
I often find myself figuring out details in the context of a particular graphic I want to produce, but my general tactic is always what is stated above, building layer by layer.
If you want very simple plots, qplot is provided. Here are some parts of example(qplot).
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Here, you have the same graph drawn for subsets of data for comparison purposes (V/S versus automatic or manual transmission).
Including a jitter geom to show actual points.
Let’s redo the barley dataset plot in ggplot.
The function qplot for quick plot is easiest to use. Note that using geom = "density" above defines a stat implicitly. In fact, in version 2.0 of ggplot2 the parameter stat is deprecated. This is also the reason why qplot can only go so far; the interdependence of parameters made necessary by having one simple function do everything means that plot construction is not completely obvious.
Side-by-side plots can be created, but beware of some details. If you try:
you’ll find that it has no effect because grid upon which ggplot is based has different concepts of viewports etc. Other packages such as gridExtra have some useful functions for various
library(gridExtra)
p1 <- qplot(yield, data = barley, geom="density", col = site)
## Explicit invocation
p2 <- qplot(yield, data = barley, geom="histogram", fill = site)
grid.arrange(p1, p2, ncol=2)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Sites in the barley dataset are naturally candidates for facets.
Here is the density plot, faceted.
Like lattice, ggplot objects can be manipulated and incrementally built/updated.
## [1] "gg" "ggplot"
## [1] "list"
## List of 9
## $ data :'data.frame': 120 obs. of 4 variables:
## ..$ yield : num [1:120] 27 48.9 27.4 39.9 33 ...
## ..$ variety: Factor w/ 10 levels "Svansota","No. 462",..: 3 3 3 3 3 3 7 7 7 7 ...
## ..$ year : Factor w/ 2 levels "1932","1931": 2 2 2 2 2 2 2 2 2 2 ...
## ..$ site : Factor w/ 6 levels "Grand Rapids",..: 3 6 4 5 1 2 3 6 4 5 ...
## $ layers :List of 1
## ..$ :Classes 'LayerInstance', 'Layer', 'ggproto', 'gg' <ggproto object: Class LayerInstance, Layer, gg>
## aes_params: list
## compute_aesthetics: function
## compute_geom_1: function
## compute_geom_2: function
## compute_position: function
## compute_statistic: function
## data: waiver
## draw_geom: function
## finish_statistics: function
## geom: <ggproto object: Class GeomDensity, GeomArea, GeomRibbon, Geom, gg>
## aesthetics: function
## default_aes: list
## draw_group: function
## draw_key: function
## draw_layer: function
## draw_panel: function
## extra_params: na.rm
## handle_na: function
## non_missing_aes:
## optional_aes:
## parameters: function
## required_aes: x y
## setup_data: function
## use_defaults: function
## super: <ggproto object: Class GeomArea, GeomRibbon, Geom, gg>
## geom_params: list
## inherit.aes: TRUE
## layer_data: function
## map_statistic: function
## mapping: NULL
## position: <ggproto object: Class PositionIdentity, Position, gg>
## compute_layer: function
## compute_panel: function
## required_aes:
## setup_data: function
## setup_params: function
## super: <ggproto object: Class Position, gg>
## print: function
## setup_layer: function
## show.legend: NA
## stat: <ggproto object: Class StatDensity, Stat, gg>
## aesthetics: function
## compute_group: function
## compute_layer: function
## compute_panel: function
## default_aes: uneval
## extra_params: na.rm
## finish_layer: function
## non_missing_aes:
## parameters: function
## required_aes: x
## retransform: TRUE
## setup_data: function
## setup_params: function
## super: <ggproto object: Class Stat, gg>
## stat_params: list
## super: <ggproto object: Class Layer, gg>
## $ scales :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg>
## add: function
## clone: function
## find: function
## get_scales: function
## has_scale: function
## input: function
## n: function
## non_position_scales: function
## scales: list
## super: <ggproto object: Class ScalesList, gg>
## $ mapping :List of 2
## ..$ x : language ~yield
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## ..$ colour: language ~site
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## ..- attr(*, "class")= chr "uneval"
## $ theme : list()
## $ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto', 'gg' <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: TRUE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_params: function
## setup_params: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## $ facet :Classes 'FacetNull', 'Facet', 'ggproto', 'gg' <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## $ plot_env :<environment: R_GlobalEnv>
## $ labels :List of 4
## ..$ y : chr ""
## ..$ x : chr "yield"
## ..$ colour: chr "site"
## ..$ fill : chr "fill"
## - attr(*, "class")= chr [1:2] "gg" "ggplot"
Suppose we create a faceted plot.
## Warning: `stat` is deprecated
## List of 9
## $ data :'data.frame': 120 obs. of 4 variables:
## ..$ yield : num [1:120] 27 48.9 27.4 39.9 33 ...
## ..$ variety: Factor w/ 10 levels "Svansota","No. 462",..: 3 3 3 3 3 3 7 7 7 7 ...
## ..$ year : Factor w/ 2 levels "1932","1931": 2 2 2 2 2 2 2 2 2 2 ...
## ..$ site : Factor w/ 6 levels "Grand Rapids",..: 3 6 4 5 1 2 3 6 4 5 ...
## $ layers :List of 1
## ..$ :Classes 'LayerInstance', 'Layer', 'ggproto', 'gg' <ggproto object: Class LayerInstance, Layer, gg>
## aes_params: list
## compute_aesthetics: function
## compute_geom_1: function
## compute_geom_2: function
## compute_position: function
## compute_statistic: function
## data: waiver
## draw_geom: function
## finish_statistics: function
## geom: <ggproto object: Class GeomDensity, GeomArea, GeomRibbon, Geom, gg>
## aesthetics: function
## default_aes: list
## draw_group: function
## draw_key: function
## draw_layer: function
## draw_panel: function
## extra_params: na.rm
## handle_na: function
## non_missing_aes:
## optional_aes:
## parameters: function
## required_aes: x y
## setup_data: function
## use_defaults: function
## super: <ggproto object: Class GeomArea, GeomRibbon, Geom, gg>
## geom_params: list
## inherit.aes: TRUE
## layer_data: function
## map_statistic: function
## mapping: NULL
## position: <ggproto object: Class PositionIdentity, Position, gg>
## compute_layer: function
## compute_panel: function
## required_aes:
## setup_data: function
## setup_params: function
## super: <ggproto object: Class Position, gg>
## print: function
## setup_layer: function
## show.legend: NA
## stat: <ggproto object: Class StatDensity, Stat, gg>
## aesthetics: function
## compute_group: function
## compute_layer: function
## compute_panel: function
## default_aes: uneval
## extra_params: na.rm
## finish_layer: function
## non_missing_aes:
## parameters: function
## required_aes: x
## retransform: TRUE
## setup_data: function
## setup_params: function
## super: <ggproto object: Class Stat, gg>
## stat_params: list
## super: <ggproto object: Class Layer, gg>
## $ scales :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg>
## add: function
## clone: function
## find: function
## get_scales: function
## has_scale: function
## input: function
## n: function
## non_position_scales: function
## scales: list
## super: <ggproto object: Class ScalesList, gg>
## $ mapping :List of 2
## ..$ x : language ~yield
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## ..$ colour: language ~site
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## ..- attr(*, "class")= chr "uneval"
## $ theme : list()
## $ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto', 'gg' <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: TRUE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_params: function
## setup_params: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## $ facet :Classes 'FacetNull', 'Facet', 'ggproto', 'gg' <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## $ plot_env :<environment: R_GlobalEnv>
## $ labels :List of 4
## ..$ y : chr ""
## ..$ x : chr "yield"
## ..$ colour: chr "site"
## ..$ fill : chr "fill"
## - attr(*, "class")= chr [1:2] "gg" "ggplot"
While qplot can do a lot, it is really meant to provide an analog of the plot function in base graphics. So if you want to do something more involved, you will find yourself working directly with ggplot objects.
First off, we note that the call ggplot() initializes a ggplot object.
These invocations are often used when building plots layer by layer.
Here is a way to build up the plot incrementally.
p <- ggplot()
p <- p + geom_point(aes(x = yield,
y = variety,
colour = year,
group = year),
data=barley)
p <- p + facet_wrap(facets = ~site)
pAlpha Blending is a way to control over plotting to avoid splotches of color that obscure structure.
If you plot the data directly, you can barely see the points.
Using alpha blending to alleviate overplotting in sample data from a bivariate normal.
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
This is a dataset containing depression scores for transplant patients.
load("sipat-ex.RData")
p <- ggplot()
p <- p + geom_density(aes(x = score, color = txType),
data = sipat)
pWe can fill the curves.
To prevent obscuring, we can specify transparency/alpha. Notice, that the data argument can be specified one time for all plot operations, or separately for each additional layer if different. Here there is only one data set.
The last, unfortunately, adds a useless legend for the alpha value. To suppress that you can use the scale_alpha_continuous or the scale_alpha_discrete function in ggplot2.
p <- ggplot(data = sipat)
p <- p + geom_density(aes(x = score, fill = txType, alpha = 0.5)) +
scale_alpha_continuous(guide = FALSE)
pSimilar effects can be obtained using using qplot as well, as shown below. But understanding the full power of ggplot2 is worth the effort.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The book discusses an example.
In 2011, URL shortening service bit.ly partnered with the United
States government website usa.gov to provide a feed of anonymous data
gathered from users who shorten links ending with .gov or .mil. As of
this writing, in addition to providing a live feed, hourly snapshots
are available as downloadable text files.
The goal is to process this data and produce some descriptive stats by time zones and also a graphic showing the top 10 time zone by operating system used for browsing. The graphic in the book is the following:
I downloaded the data from the website (bitly.RDS) and we will process the data as they do and produce the graphics shown there using ggplot2.
## [1] "list"
## List of 16
## $ a : chr "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11"
## $ c : chr "US"
## $ nk: int 1
## $ tz: chr "America/New_York"
## $ gr: chr "MA"
## $ g : chr "A6qOVH"
## $ h : chr "wfLQtf"
## $ l : chr "orofrog"
## $ al: chr "en-US,en;q=0.8"
## $ hh: chr "1.usa.gov"
## $ r : chr "http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf"
## $ u : chr "http://www.ncbi.nlm.nih.gov/pubmed/22415991"
## $ t : int 1331923247
## $ hc: int 1331822918
## $ cy: chr "Danvers"
## $ ll: num [1:2] 42.6 -71
tz <- unlist(lapply(data, function(x) x$tz))
tz[tz == ""] <- "Unknown"
tzTable <- sort(table(tz), decreasing=TRUE)
head(tzTable, 3)## tz
## America/New_York Unknown America/Chicago
## 1251 521 400
## tz
## Europe/Sofia Europe/Uzhgorod Europe/Volgograd
## 1 1 1
"Unknown" and canonicalize the operating system which is stored in the variable a.library(dplyr)
getElement <- function(list, var) list[[var]]
tzAndOs <- tibble(tz = unlist(lapply(data, getElement, var = "tz")),
agent = unlist(lapply(data, getElement, var = "a"))) %>%
mutate(tz = ifelse(tz == "", "Unknown", tz),
os = ifelse(grepl("Windows", agent), "Windows", "Nonwindows")) %>%
select(tz, os)library(ggplot2)
g <- ggplot(data = topTenTz, aes(x = tz, y = count)) +
geom_bar(stat = "identity" )
gWell, we need to flip the axes. ___
While we are at it, let’s also change the color.
ggplot(data = topTenTz, aes(x = tz, y = count)) +
geom_bar(stat = "identity", color = "#E69F00", fill = "#E69F00") +
coord_flip()tzAndOs and topTenTz.So here we go. Note how we force the topTenTz time zone ordering in the last mutate below.
topTenTzAndOs <- tzAndOs %>%
inner_join(topTenTz) %>%
group_by(tz, os) %>%
summarize(count = n()) %>%
ungroup %>%
mutate(tz = factor(tz, levels = rev(topTenTz$tz)))## Joining, by = "tz"
Let us see what this looks like.
## # A tibble: 19 x 3
## tz os count
## <fct> <chr> <int>
## 1 America/Chicago Nonwindows 115
## 2 America/Chicago Windows 285
## 3 America/Denver Nonwindows 132
## 4 America/Denver Windows 59
## 5 America/Los_Angeles Nonwindows 130
## 6 America/Los_Angeles Windows 252
## 7 America/New_York Nonwindows 339
## 8 America/New_York Windows 912
## 9 America/Sao_Paulo Nonwindows 13
## 10 America/Sao_Paulo Windows 20
## 11 Asia/Tokyo Nonwindows 2
## 12 Asia/Tokyo Windows 35
## 13 Europe/London Nonwindows 43
## 14 Europe/London Windows 31
## 15 Europe/Madrid Nonwindows 16
## 16 Europe/Madrid Windows 19
## 17 Pacific/Honolulu Windows 36
## 18 Unknown Nonwindows 245
## 19 Unknown Windows 276
Note that this is a tidy data set.
library(scales)
color <- c("blue", "#E69F00")
topTenTzAndOs %>%
ggplot(aes(x = tz, y = count, fill = os)) +
geom_bar(position = "fill", stat = "identity") +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = color) +
coord_flip()If I wanted to reproduce the exact colors from the Python book:
topTenTzAndOs %>%
ggplot(aes(x = tz, y = count, fill = os)) +
geom_bar(position = "fill", stat = "identity") +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = c("blue", "red")) +
coord_flip()I like the ggplot graphic better!
Consider the example vignette in the package deconvolveR.
I’ll focus on one plot, the graphic for the Normal Example.
aes versus what is outside; i.e. geom_point(aes(x=a, y=b, color=col)) is different from geom_point(aes(x=a, y=b), color = col). The former applies on a element-by-element basis, whereas the latter does it for the entire geom.There are tons of resources for ggplot2.
A good introduction to ggplot approach is Hadley’s JASA paper: A Layered Grammar of Graphics JCGS Volume 19, Number 1, Pages 3–28 (2010).
ggplot2 by Hadley Wickham which is available online via Stanford libraries. However, the is dated. So the online web pages are a good resource as well as the github repo. Also the R blogs and stackoverflow are excellent resources
I highly recommend the online book R Graphics Cookbook by Winston Chang (OReilly) available online via Stanford libraries. This is probem oriented and is a great resource under duress!!
Ryan Rosario’s [minicourse]{http://www.bytemining.com/2010/01/advanced-graphics-in-r/}
Lattice reference is Lattice Graphics by Deepayan Sarkar, available via Stanford libraries.
R Graphics (2nd ed.) by Paul Murrell\(^*\). Ch.2-3 covers base graphics and is also available via Stanford libraries.
A very readable introduction to use of color is Choosing Colour Palettes for Statistical Graphics Zeilis and Hornik (2006) (Google for it.)
Other packages worth knowing about are: RColorBrewer (functions colorRampPalette etc.), scatterplot3D and rgl for 3d visualization system
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin19.2.0 (64-bit)
## Running under: macOS Catalina 10.15.3
##
## Matrix products: default
## BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] scales_1.1.0 png_0.1-7 lattice_0.20-38 gridExtra_2.3
## [5] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3
## [9] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1
## [13] tidyverse_1.3.0 rmarkdown_2.0 knitr_1.26 pkgdown_1.4.1
## [17] devtools_2.2.1 usethis_1.5.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.1 pkgload_1.0.2 jsonlite_1.6 modelr_0.1.5
## [5] assertthat_0.2.1 cellranger_1.1.0 yaml_2.2.0 remotes_2.1.0
## [9] sessioninfo_1.1.1 pillar_1.4.3 backports_1.1.5 glue_1.3.1
## [13] digest_0.6.23 rvest_0.3.5 colorspace_1.4-1 plyr_1.8.5
## [17] htmltools_0.4.0 pkgconfig_2.0.3 broom_0.5.3 haven_2.2.0
## [21] processx_3.4.1 generics_0.0.2 farver_2.0.1 ellipsis_0.3.0
## [25] withr_2.1.2 lazyeval_0.2.2 cli_2.0.1 magrittr_1.5
## [29] crayon_1.3.4 readxl_1.3.1 memoise_1.1.0 evaluate_0.14
## [33] ps_1.3.0 fs_1.3.1 fansi_0.4.1 nlme_3.1-143
## [37] MASS_7.3-51.5 xml2_1.2.2 pkgbuild_1.0.6 tools_3.6.2
## [41] prettyunits_1.1.1 hms_0.5.2 lifecycle_0.1.0 munsell_0.5.0
## [45] reprex_0.3.0 callr_3.4.1 compiler_3.6.2 rlang_0.4.4
## [49] rstudioapi_0.10 labeling_0.3 testthat_2.3.1 gtable_0.3.0
## [53] DBI_1.1.0 reshape2_1.4.3 R6_2.4.1 lubridate_1.7.4
## [57] utf8_1.1.4 rprojroot_1.3-2 desc_1.2.0 stringi_1.4.5
## [61] Rcpp_1.0.3 vctrs_0.2.2 dbplyr_1.4.2 tidyselect_0.2.5
## [65] xfun_0.11